Section 3 - The five verbs and select in more detail
The five verbs and their meaning
The dplyr package contains five key data manipulation functions, also called verbs:
select(), which returns a subset of the columns,
filter(), that is able to return a subset of the rows,
arrange(), that reorders the rows according to single or multiple variables,mutate(), used to add columns from existing data,
summarize(), which reduces each group to a single row by calculating aggregate measures.
If you want to find out more about these functions, consult the documentation by clicking on the functions above.
Choosing is not losing! The select verb
To answer the simple question whether flight delays tend to shrink or grow during a flight, we can safely discard a lot of the variables of each flight. To select only the ones that matter, we can use select().
As an example, take the following call, that selects the variables var1 and var2 from the data frame df.
select(df, var1, var2)
You can also use : to select a range of variables and - to exclude some variables, similar to indexing a data.frame with square brackets. You can use both variable’s names as well as integer indexes. This call selects the four first variables except for the second one of a data frame df:
select(df, 1:4, -2)
select() does not change the data frame it is called on; you have to explicitly assign the result of select() to a variable to store the result.
# Load the hflights package
library(hflights)
hflights_df <- hflights[sample(nrow(hflights), 720), ]
hflights <- as_tibble(hflights_df)
# Print out a tbl with the four columns of hflights related to delay
head(select(hflights, ActualElapsedTime, AirTime, ArrDelay, DepDelay))## # A tibble: 6 x 4
## ActualElapsedTime AirTime ArrDelay DepDelay
## <int> <int> <int> <int>
## 1 113 92 -12 -5
## 2 254 235 -7 2
## 3 169 148 73 84
## 4 51 32 -11 -13
## 5 80 65 3 15
## 6 56 41 -3 1
# Print out the columns Origin up to Cancelled of hflights
head(select(hflights, 14:19))## # A tibble: 6 x 6
## Origin Dest Distance TaxiIn TaxiOut Cancelled
## <chr> <chr> <int> <int> <int> <int>
## 1 HOU ATL 696 7 14 0
## 2 IAH PDX 1825 5 14 0
## 3 IAH SLC 1195 13 8 0
## 4 IAH SHV 192 3 16 0
## 5 IAH MEM 468 5 10 0
## 6 HOU DAL 239 3 12 0
# Answer to last question: be concise!
head(select(hflights, 1:4, 12:21))## # A tibble: 6 x 14
## Year Month DayofMonth DayOfWeek ArrDelay DepDelay Origin Dest Distance
## <int> <int> <int> <int> <int> <int> <chr> <chr> <int>
## 1 2011 1 6 4 -12 -5 HOU ATL 696
## 2 2011 6 2 4 -7 2 IAH PDX 1825
## 3 2011 7 23 6 73 84 IAH SLC 1195
## 4 2011 9 9 5 -11 -13 IAH SHV 192
## 5 2011 12 28 3 3 15 IAH MEM 468
## 6 2011 1 30 7 -3 1 HOU DAL 239
## # ... with 5 more variables: TaxiIn <int>, TaxiOut <int>, Cancelled <int>,
## # CancellationCode <chr>, Diverted <int>
Bear in mind that select() does not modify the original dataset. Instead, select() returns a new, modified copy. Therefore, choosing is not losing! You should save the result to a new variable if you want to be able to use it later.
Helper functions for variable selection
dplyr comes with a set of helper functions that can help you select groups of variables inside a select() call:
starts_with("X") # every name that starts with "X",
ends_with("X") # every name that ends with "X",
contains("X") # every name that contains "X",
matches("X") # every name that matches "X", where "X" can be a regular expression,
num_range("x", 1:5) # the variables named x01, x02, x03, x04 and x05,
one_of(x) # every name that appears in x, which should be a character vector.Pay attention here: When you refer to columns directly inside select(), you don’t use quotes. If you use the helper functions, you do use quotes.
# Print out a tbl containing just ArrDelay and DepDelay
head(select(hflights, ends_with(c('Delay'))))## # A tibble: 6 x 2
## ArrDelay DepDelay
## <int> <int>
## 1 -12 -5
## 2 -7 2
## 3 73 84
## 4 -11 -13
## 5 3 15
## 6 -3 1
# Print out a tbl as described in the second instruction, using both helper functions and variable names
head(select(hflights, UniqueCarrier, ends_with(c('Num')), starts_with(c('Cancel'))))## # A tibble: 6 x 5
## UniqueCarrier FlightNum TailNum Cancelled CancellationCode
## <chr> <int> <chr> <int> <chr>
## 1 DL 1204 N964DL 0 ""
## 2 CO 209 N38403 0 ""
## 3 OO 4497 N813SK 0 ""
## 4 XE 4682 N21129 0 ""
## 5 XE 4364 N13929 0 ""
## 6 WN 44 N676SW 0 ""
# Print out a tbl as described in the third instruction, using only helper functions.
head(select(hflights, ends_with("Time"), ends_with("Delay")))## # A tibble: 6 x 6
## DepTime ArrTime ActualElapsedTime AirTime ArrDelay DepDelay
## <int> <int> <int> <int> <int> <int>
## 1 1244 1537 113 92 -12 -5
## 2 2137 2351 254 235 -7 2
## 3 1544 1733 169 148 73 84
## 4 2052 2143 51 32 -11 -13
## 5 1620 1740 80 65 3 15
## 6 1701 1757 56 41 -3 1
This exercise demonstrates the power of the helper functions. They can limit the amount of coding drastically. Beware of too obscure queries however; you still want your code to be clear for fellow data scientists!
Comparison to base R
To see the added value of the dplyr package, it is useful to compare its syntax with base R. Up to now, you have only considered functionality that is also available without the use of dplyr. The elegance and ease-of-use of dplyr is a great plus though.
# Finish select call so that ex1d matches ex1r
ex1r <- hflights[c("TaxiIn", "TaxiOut", "Distance")]
ex1d <- select(hflights, starts_with("Taxi"), Distance)
# Finish select call so that ex2d matches ex2r
ex2r <- hflights[c("Year", "Month", "DayOfWeek", "DepTime", "ArrTime")]
ex2d <- select(hflights, Year, Month, DayOfWeek, DepTime, ArrTime)
# Finish select call so that ex3d matches ex3r
ex3r <- hflights[c("TailNum", "TaxiIn", "TaxiOut")]
ex3d <- select(hflights, TailNum, starts_with("Taxi"))Section 4 - The second of five verbs: mutate
Mutating is creating
mutate() is the second of five data manipulation functions you will get familiar with in this course. mutate() creates new columns which are added to a copy of the dataset.
Take this example that adds a new column, z, which is the element-wise sum of the columns x and y, to the data frame df:
mutate(df, z = x + y)
# Add the new variable ActualGroundTime to a copy of hflights and save the result as g1.
g1 <- mutate(hflights, ActualGroundTime = ActualElapsedTime - AirTime)
# Add the new variable GroundTime to g1. Save the result as g2.
g2 <- mutate(g1, GroundTime = TaxiIn + TaxiOut)
# Add the new variable AverageSpeed to g2. Save the result as g3.
g3 <- mutate(g2, AverageSpeed = 60 * Distance / AirTime)
# Print out g3
g3 %>%
head(10) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| Year | Month | DayofMonth | DayOfWeek | DepTime | ArrTime | UniqueCarrier | FlightNum | TailNum | ActualElapsedTime | AirTime | ArrDelay | DepDelay | Origin | Dest | Distance | TaxiIn | TaxiOut | Cancelled | CancellationCode | Diverted | ActualGroundTime | GroundTime | AverageSpeed |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2011 | 1 | 6 | 4 | 1244 | 1537 | DL | 1204 | N964DL | 113 | 92 | -12 | -5 | HOU | ATL | 696 | 7 | 14 | 0 | 0 | 21 | 21 | 453.9130 | |
| 2011 | 6 | 2 | 4 | 2137 | 2351 | CO | 209 | N38403 | 254 | 235 | -7 | 2 | IAH | PDX | 1825 | 5 | 14 | 0 | 0 | 19 | 19 | 465.9574 | |
| 2011 | 7 | 23 | 6 | 1544 | 1733 | OO | 4497 | N813SK | 169 | 148 | 73 | 84 | IAH | SLC | 1195 | 13 | 8 | 0 | 0 | 21 | 21 | 484.4595 | |
| 2011 | 9 | 9 | 5 | 2052 | 2143 | XE | 4682 | N21129 | 51 | 32 | -11 | -13 | IAH | SHV | 192 | 3 | 16 | 0 | 0 | 19 | 19 | 360.0000 | |
| 2011 | 12 | 28 | 3 | 1620 | 1740 | XE | 4364 | N13929 | 80 | 65 | 3 | 15 | IAH | MEM | 468 | 5 | 10 | 0 | 0 | 15 | 15 | 432.0000 | |
| 2011 | 1 | 30 | 7 | 1701 | 1757 | WN | 44 | N676SW | 56 | 41 | -3 | 1 | HOU | DAL | 239 | 3 | 12 | 0 | 0 | 15 | 15 | 349.7561 | |
| 2011 | 5 | 8 | 7 | 1911 | 2139 | CO | 670 | N14250 | 268 | 248 | 20 | 11 | IAH | SFO | 1635 | 5 | 15 | 0 | 0 | 20 | 20 | 395.5645 | |
| 2011 | 7 | 16 | 6 | 651 | 913 | MQ | 3859 | N543MQ | 142 | 123 | -17 | -4 | IAH | ORD | 925 | 6 | 13 | 0 | 0 | 19 | 19 | 451.2195 | |
| 2011 | 9 | 18 | 7 | 1942 | 2029 | XE | 4706 | N14938 | 47 | 35 | 115 | 122 | IAH | LFT | 201 | 4 | 8 | 0 | 0 | 12 | 12 | 344.5714 | |
| 2011 | 5 | 14 | 6 | 1859 | 2223 | XE | 2438 | N14174 | 144 | 119 | 11 | -1 | IAH | CVG | 871 | 7 | 18 | 0 | 0 | 25 | 25 | 439.1597 |
Add multiple variables using mutate
So far you’ve added variables to hflights one at a time, but you can also use mutate() to add multiple variables at once. To create more than one variable, place a comma between each variable that you define inside mutate().
mutate() even allows you to use a new variable while creating a next variable in the same call. In this example, the new variable x is directly reused to create the new variable y:
mutate(my_df, x = a + b, y = x + c)
# Add a second variable loss_ratio to the dataset: m1
m1 <- mutate(hflights, loss = ArrDelay - DepDelay, loss_ratio = loss / DepDelay)
# Add the three variables as described in the third instruction: m2
m2 <- mutate(hflights, TotalTaxi = TaxiIn + TaxiOut, ActualGroundTime = ActualElapsedTime - AirTime, Diff = TotalTaxi - ActualGroundTime)Session info
sessionInfo()## R version 3.5.2 (2018-12-20)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=German_Switzerland.1252 LC_CTYPE=German_Switzerland.1252
## [3] LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C
## [5] LC_TIME=German_Switzerland.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] hflights_0.1 ggplot2_3.1.0 dplyr_0.8.0.1 gapminder_0.3.0
## [5] kableExtra_1.0.1 knitr_1.21
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.0 highr_0.7 plyr_1.8.4
## [4] pillar_1.3.1 compiler_3.5.2 prettydoc_0.2.1
## [7] tools_3.5.2 digest_0.6.18 gtable_0.2.0
## [10] evaluate_0.12 tibble_2.0.1 viridisLite_0.3.0
## [13] pkgconfig_2.0.2 rlang_0.3.1 cli_1.0.1
## [16] rstudioapi_0.9.0 yaml_2.2.0 xfun_0.4
## [19] withr_2.1.2 httr_1.4.0 stringr_1.4.0
## [22] xml2_1.2.0 hms_0.4.2 webshot_0.5.1
## [25] grid_3.5.2 tidyselect_0.2.5 glue_1.3.0
## [28] R6_2.4.0 fansi_0.4.0 rmarkdown_1.11
## [31] readr_1.3.1 purrr_0.3.0 magrittr_1.5
## [34] scales_1.0.0 htmltools_0.3.6 assertthat_0.2.0
## [37] rvest_0.3.2 colorspace_1.4-0 utf8_1.1.4
## [40] stringi_1.3.1 lazyeval_0.2.1 munsell_0.5.0
## [43] crayon_1.3.4